CCDL 2020

Objective for this notebook analysis:

We’ll use the same gene expression dataset we used in the previous notebook. It is a pre-processed astrocytoma microarray dataset that we performed a set of differential expression analyses on.

More tidyverse resources:
- R for Data Science
- tidyverse documentation
- Cheatsheet of tidyverse data transformation
- Online tidyverse book chapter

Set Up

The tidyverse is a collection of packages that are handy for general data wrangling, analysis, and visualization. Other packages that are specifically handy for different biological analyses are found on Bioconductor. If we want to use a package’s functions we first need to install them.

Our RStudio Server already has the tidyverse group of packages installed for you. But if you needed to install it or other packages available on CRAN, you do it using the install.packages() function like this: install.packages("tidyverse").

library(tidyverse)

Referencing a library’s function with ::

Note that if we had not imported the tidyverse set of packages using library() like above, and we wanted to use a tidyverse function like read_tsv(), we would need to tell R what package to find this function in. To do this, we would use :: to tell R to load in this function from the readr package by using readr::read_tsv(). You will see this :: method of referencing libraries within packages throughout the course. We like to use it in part to remove any ambiguity in which version of a function we are using; it is not too uncommon for different packages to use the same name for very different functions!

Managing directories

Before we can import the data we need, we should double check where R is looking for files, aka the current working directory. We can do this by using the getwd() function, which will tell us what folder we are in.

# Let's check what directory we are in:
getwd()
[1] "/Users/ccdl/Desktop/cloned_repos/training-modules/intro-to-R-tidyverse"
/Users/ccdl/Desktop/cloned_repos/training-modules/intro-to-R-tidyverse

For Rmd files, the working directory is wherever the file is located, but commands executed in the console may have a different working directory.

We will want to make a directory for our output and we will call this directory: results. But before we create the directory, we should check if it already exists. We will show two ways that we can do this.

First, we can use the dir() function to have R list the files in our working directory.

# Let's check what files are here
dir()
 [1] "00a-rstudio_guide.md"                      
 [2] "00b-debugging_resources.md"                
 [3] "00c-good-scientific-coding-practices.md"   
 [4] "01-intro_to_base_R-live.Rmd"               
 [5] "01-intro_to_base_R.nb.html"                
 [6] "01-intro_to_base_R.Rmd"                    
 [7] "01b-intro_to_base_R_exercise.Rmd"          
 [8] "02-intro_to_ggplot2-live.Rmd"              
 [9] "02-intro_to_ggplot2.nb.html"               
[10] "02-intro_to_ggplot2.Rmd"                   
[11] "03-intro_to_tidyverse-live.nb.html"        
[12] "03-intro_to_tidyverse-live.Rmd"            
[13] "03-intro_to_tidyverse.nb.html"             
[14] "03-intro_to_tidyverse.Rmd"                 
[15] "04a-intro_to_R_exercise.Rmd"               
[16] "04b-intro_to_tidyverse_exercise-part-1.Rmd"
[17] "04c-intro_to_tidyverse_exercise-part-2.Rmd"
[18] "data"                                      
[19] "diagrams"                                  
[20] "exercise-results"                          
[21] "plots"                                     
[22] "README.md"                                 
[23] "results"                                   
[24] "screenshots"                               
[25] "scripts"                                   
00a-rstudio_guide.md

00b-debugging_resources.md

00c-good-scientific-coding-practices.md

01-intro_to_base_R-live.Rmd

01-intro_to_base_R.nb.html

01-intro_to_base_R.Rmd

01b-intro_to_base_R_exercise.Rmd

02-intro_to_ggplot2-live.Rmd

02-intro_to_ggplot2.nb.html

02-intro_to_ggplot2.Rmd

03-intro_to_tidyverse-live.nb.html

03-intro_to_tidyverse-live.Rmd

03-intro_to_tidyverse.nb.html

03-intro_to_tidyverse.Rmd

04a-intro_to_R_exercise.Rmd

04b-intro_to_tidyverse_exercise-part-1.Rmd

04c-intro_to_tidyverse_exercise-part-2.Rmd

data

diagrams

exercise-results

plots

README.md

results

screenshots

scripts

This shows us there is no folder called “results” yet.

If we want to more pointedly look for “results” in our working directory we can use the dir.exists() function.

# Check if the results directory exists
dir.exists("results")
[1] TRUE

If the above says FALSE that means we will need to create a results directory using the function dir.create().

# Make a directory within the working directory called 'results'
dir.create("results")
Warning in dir.create("results"): 'results' already exists

After creating the results directory above, let’s re-run dir.exists() to see if now it exists.

# Re-check if the results directory exists
dir.exists("results")
[1] TRUE

We can use the output of dir.exists() to automatically create or hold off on creating a directory by putting this together in an if statement like below. An if statement has two main parts: First, the test, which is an expression that will result in either TRUE or FALSE. This is put in parenthesis immediately after the if. The next part is the body, which is the commands that will be executed if the test is TRUE. These are placed within a set of braces { }. Note that we used an exclamation point in the test to signify that we want a directory to be created only if dir.exists(results) is NOT equal to TRUE.

# If 'results' directory doesn't exist...
if (!dir.exists("results")) {
  # ... create a 'results' directory
  dir.create("results")
}

The dir.exists() function will not work on files themselves. In that case, there is an analogous function called file.exists().

Try using the file.exists() function to see if the file gene_results_GSE44971.tsv exists in the current directory. Use the code chunk we set up for you below. Note that in our notebooks (and sometimes elsewhere), wherever you see a <FILL_IN_THE_BLANK> like in the chunk below, that is meant for you to replace (including the angle brackets) with the correct phrase before you run the chunk (otherwise you will get an error).

# Replace the <PUT_FILE_NAME_HERE> with the name of the file you are looking for
# Remember to use quotes to make it a character string
file.exists(<PUT_FILE_NAME_HERE>)

Now that we’ve determined that gene_results_GSE44971.tsv exists, we are ready to read it into our R environment.

Read a TSV file

Declare the name of the directory where we will read in the data.

data_dir <- "data"

Although base R has functions to read in data files, the functions in the readr package (part of the tidyverse) are faster and more straightforward to use so we are going to use those here. Because the file we are reading in is a TSV (tab separated values) file we will be using the read_tsv function. There are analogous functions for CSV (comma separated values) files (read_csv()) and other files types.

Read in the differential expression analysis results file

stats_df <- readr::read_tsv(
  file.path(data_dir,
            "gene_results_GSE44971.tsv")
  )
Parsed with column specification:
cols(
  ensembl_id = col_character(),
  gene_symbol = col_character(),
  contrast = col_character(),
  log_fold_change = col_double(),
  avg_expression = col_double(),
  t_statistic = col_double(),
  p_value = col_double(),
  adj_p_value = col_double()
)

Following the template of the previous chunk, use this chunk to read in the file GSE44971.tsv that is in the data folder and save it in the variable gene_df.

# Use this chunk to read in data from the file `GSE44971.tsv`
gene_df <- readr::read_tsv(
  file.path(data_dir,
            "GSE44971.tsv")
  )
Parsed with column specification:
cols(
  .default = col_double(),
  Gene = col_character()
)
See spec(...) for full column specifications.

Use this chunk to explore what gene_df looks like.

# Explore `gene_df`

What information is contained in gene_df?

dplyr pipes

One nifty feature of the tidyverse is pipes: %>% These handy things allows you to funnel the result of one expression to the next, making your code a little more streamlined.

For example, the output from this:

filter(stats_df, contrast == "male_female")

…is the same as the output from this:

stats_df %>% filter(contrast == "male_female")

This can make your code cleaner and easier to follow a series of related commands. Let’s look at an example with our stats of of how the same functions look with or without pipes:

Example 1: without pipes:

stats_arranged <- arrange(stats_df, t_statistic)
stats_filtered <- filter(stats_arranged, avg_expression > 50)
stats_nopipe <- select(stats_filtered, contrast, log_fold_change, p_value)

UGH, we have to keep track of all of those different intermediate data frames and type their names so many times here! We could maybe streamline things by using the same variable name at each stage, but even then there is a lot of extra typing, and it is easy to get confused about what has been done where. It’s annoying and makes it harder for people to read.

Example 2: Same result as 1 but with pipes!

# Example of the same modifications as above but with pipes!
stats_pipe  <- stats_df %>%
               arrange(t_statistic) %>%
               filter(avg_expression > 50) %>%
               select(contrast, log_fold_change, p_value)

What the %>% (pipe) is doing here is feeding the result of the expression on its left into the first argument of the next function (to its right, or on the next line here). We can then skip that first argument (the data in these cases), and move right on to the part we care about at that step: what we are arranging, filtering, or selecting in this case.

Let’s double check that these are the same by using the function, all.equal().

all.equal(stats_nopipe, stats_pipe)
[1] TRUE

all.equal() is letting us know that these two objects are the same.

Now that hopefully you are convinced that the tidyverse can help you make your code neater and easier to use and read, let’s go through some of the popular tidyverse functions and so we can create pipelines like this.

Common tidyverse functions

Let’s say we wanted to filter this gene expression dataset to particular sample groups. In order to do this, we would use the function filter() as well as a logic statement (usually one that refers to a column or columns in the data frame).

# Here let's filter stats_df to the gene_symbol "SNCA"
stats_df %>% 
  filter(gene_symbol == "SNCA")

We can use filter() similarly for numeric statements.

# Here let's filter the data to rows with average expression values above 50
stats_df %>%
  filter(avg_expression > 50)

We can apply multiple filters at once, which will require all of them to be satisfied for every row in the results:

# filter to highly expressed genes with contrast "male_female"
stats_df %>%
  filter(contrast == "male_female", 
         avg_expression > 50)

When we are filtering, the %in% operator can come in handy if we have multiple items we would like to match. Let’s take a look at what using %in% does.

genes_of_interest <- c("SNCA", "CDKN1A")
stats_df$gene_symbol %in% genes_of_interest

%in% returns a logical vector that now we can use in dplyr::filter.

# filter to genes of interest
stats_df %>% 
  filter(gene_symbol %in% c("SNCA", "CDKN1A"))

Let’s return to our first filter() and build on to it. This time, let’s keep only some of the columns from the data frame using the select() function. Let’s also save this as a new data frame called stats_filtered_df.

# filter to highly expressed "male_female"
# and select gene_symbol, log_fold_change and t_statistic
stats_filtered_df <- stats_df %>%
  filter(contrast == "male_female", 
         avg_expression > 50) %>%
  select(log_fold_change, t_statistic)

Let’s say we wanted to arrange this dataset so that the genes are arranged by the smallest p values to the largest. In order to do this, we would use the function arrange() as well as the column we would like to sort by (in this case p_value).

stats_df %>% 
  arrange(p_value) 

What if we want to sort from largest to smallest? Like if we want to see the genes with the highest average expression? We can use the same function, but instead use the desc() function and now we are using avg_expression column.

# arrange descending by avg_expression
stats_df %>%
  arrange(desc(avg_expression))

What if we would like to create a new column of values? For that we use mutate() function.

stats_df %>% 
  mutate(log10_p_value = -log10(p_value))

What if we want to obtain summary statistics for a column or columns? The summarize function allows us to calculate summary statistics for a column. Here we will use summarize to obtain an mean log folder change over all the genes, and its standard deviation.

stats_df %>% 
  summarize(mean(log_fold_change),
            sd(log_fold_change))

What if we’d like to obtain a summary statistics but have them for various groups? Conveniently named, there’s a function called group_by() that seamlessly allows us to do this. Also note that group_by() allows us to group by multiple variables at a time if you want to.

stats_summary_df <- stats_df %>%
      group_by(contrast) %>% 
      summarize(mean(log_fold_change),
                sd(log_fold_change))

Let’s look at a preview of what we made:

stats_summary_df

Here we have the mean log fold change expression per each contrast we made.

A brief intro to the apply family of functions

In base R, the apply family of functions can be an alternative methods for performing transformations across a data frame, matrix or other object structures.

One of this family is (shockingly) the function apply(), which operates on matrices.

A matrix is similar to a data frame in that it is a rectangular table of data, but it has an additional constraint: rather than each column having a type, ALL data in a matrix has the same type.

The first argument to apply() is the data object we want to work on. The third argument is the function we will apply to each row or column of the data object. The second argument in specifies whether we are applying the function across rows or across columns (1 for rows, 2 for columns).

Remember that gene_df is a gene x sample gene expression data frame that has columns of two different types, character and numeric, but converting it to a matrix will require us to make them all the same type. We can still coerce it into a matrix using as.matrix(), in which case R will pick a type that it can convert everything to. What does it choose?

# Coerce `gene_df` into a matrix
gene_matrix <- as.matrix(gene_df)
# Explore the structure of the `gene_matrix` object
str(gene_matrix)
 chr [1:20056, 1:59] "ENSG00000000003" "ENSG00000000005" "ENSG00000000419" ...
 - attr(*, "dimnames")=List of 2
  ..$ : NULL
  ..$ : chr [1:59] "Gene" "GSM1094814" "GSM1094815" "GSM1094816" ...

While that worked, it is rare that we want numbers converted to text, so we are going to select only the columns with numeric values. We can do this most easily by removing the first column, which contains character values.

# Let's save a new matrix object names `gene_num_matrix` containing only
# the numeric values
gene_num_matrix <- as.matrix(gene_df[, -1])

# Explore the structure of the `gene_num_matrix` object
str(gene_num_matrix)
 num [1:20056, 1:58] 9.5951 -0.0436 8.5246 1.6013 0.6189 ...
 - attr(*, "dimnames")=List of 2
  ..$ : NULL
  ..$ : chr [1:58] "GSM1094814" "GSM1094815" "GSM1094816" "GSM1094817" ...

Why do we have a [, -1] after gene_df in the above chunk?

Now that the matrix is all numbers, we can do things like calculate the column or row statistics using apply().

# Calculate row means
gene_means <- apply(gene_num_matrix, 1, mean) # Notice we are using 1 here

# How long will `gene_means` be? 
length(gene_means)
[1] 20056

Now let’s investigate the same set up, but use 2 to apply over the columns of our matrix.

# Calculate sample means
sample_means <- apply(gene_num_matrix, 2, mean) # Notice we use 2 here

# How long will `sample_means` be? 
length(sample_means)
[1] 58

We can put the gene names back into the numeric matrix object by assigning them as rownames.

# Assign the gene names from gene_df$Gene to the `gene_num_matrix` object using
# the `rownames()` function
rownames(gene_num_matrix) <- gene_df$Gene

# Explore the `gene_num_matrix` object
head(gene_num_matrix)
                 GSM1094814 GSM1094815 GSM1094816   GSM1094817  GSM1094818
ENSG00000000003  9.59510150  8.4785070 12.6802129  8.677614838 10.75552946
ENSG00000000005 -0.04361838 -0.1307889  0.5345931 -0.005805166 -0.05430255
ENSG00000000419  8.52458571  9.8405725 11.9923201  9.639163317 10.03349327
ENSG00000000457  1.60130552  1.8895554  1.3747388  1.637826214  1.63562493
ENSG00000000460  0.61891285  0.5321708  0.4805598  0.617947976  0.70636135
ENSG00000000938  0.55573058  0.9942862  1.8030176  1.237317457  0.84152852
                 GSM1094819  GSM1094820 GSM1094821  GSM1094822 GSM1094823
ENSG00000000003  6.37470691  9.10028584  7.3546860  8.51847190  9.4216113
ENSG00000000005 -0.04831174  0.01411359 -0.1108279 -0.02625776 -0.1692604
ENSG00000000419 12.78335826 10.75552946  9.1711113  9.30210174  9.4915415
ENSG00000000457  1.46586071  1.79852032  1.6389259  1.80586748  1.5813979
ENSG00000000460  0.77224572  0.89607132  0.6740559  0.63157954  0.7480556
ENSG00000000938  3.32404606  0.81562856  0.9728617  0.77129700  1.3596402
                GSM1094824  GSM1094825 GSM1094826  GSM1094827 GSM1094828
ENSG00000000003  5.0239629  7.89737460  8.1126876  7.03444640  9.6984918
ENSG00000000005 -0.1359247 -0.08624286 -0.2044839 -0.09037887 -0.1602416
ENSG00000000419 11.8835897 10.88079782  9.9174930 10.41753701 10.2695503
ENSG00000000457  1.5525410  1.92489254  1.8046590  1.50382159  1.6198069
ENSG00000000460  0.7072273  0.89196068  0.8223559  0.61970982  0.6776549
ENSG00000000938  0.8758421  0.62191515  0.7675971  0.92791338  1.0351067
                 GSM1094829 GSM1094830 GSM1094831 GSM1094832  GSM1094833
ENSG00000000003 13.98689230 10.5868331  7.6836223  8.3862587 11.18932763
ENSG00000000005 -0.05038705  0.3096031 -0.1551062 -0.1938994 -0.08369537
ENSG00000000419 10.12104053  9.3653576 10.2184110  9.5951015 12.58713982
ENSG00000000457  1.67741832  1.5762471  2.0663493  1.7504928  1.67321632
ENSG00000000460  0.71786250  0.4991620  0.7912559  0.8103023  0.94248698
ENSG00000000938  0.82152165  0.6556572  0.9782599  0.6568353  0.73458782
                GSM1094834 GSM1094835  GSM1094836 GSM1094837 GSM1094838
ENSG00000000003  9.7562003  9.6984918 10.56891510  9.9391025  7.8738131
ENSG00000000005 -0.0437601 -0.1120755 -0.08208306 -0.2067112 -0.1211891
ENSG00000000419  9.9799646 10.0798974  9.59510150  9.7417927 10.2105827
ENSG00000000457  1.3778594  1.5630889  1.74146532  1.6518036  1.7806133
ENSG00000000460  0.6201925  0.5570300  0.70084983  0.7137118  0.7355154
ENSG00000000938  0.7674132  1.2165228  0.60856106  0.6041645  1.0624067
                GSM1094839  GSM1094840 GSM1094841 GSM1094842 GSM1094843
ENSG00000000003  8.6311353  8.58077557  9.1579585  6.3317019 10.1939387
ENSG00000000005 -0.1109070 -0.03963564 -0.1148106 -0.1137150 -0.1645040
ENSG00000000419 10.1517344 11.18932763 10.5775132 13.3760971 12.2271693
ENSG00000000457  1.7281446  1.70242287  1.6503090  1.2990208  1.5687866
ENSG00000000460  0.5284219  0.67643466  0.7353276  0.6223990  0.5646406
ENSG00000000938  0.6469084  0.88120942  0.5230414  0.9909517  0.8484174
                 GSM1094844  GSM1094845  GSM1094846 GSM1094847  GSM1094848
ENSG00000000003 10.44364159  9.62435722 16.05075944  6.9334508  8.55180910
ENSG00000000005 -0.03132427 -0.01400534 -0.03529112  0.1268899 -0.03857382
ENSG00000000419 10.28571947 11.47682424  9.88540523  8.9646682 11.10911330
ENSG00000000457  1.66150175  1.62312829  1.37320729  1.3402742  0.81703931
ENSG00000000460  0.66798926  0.58089659  0.46957607  0.4222455  0.29500657
ENSG00000000938  0.53726484  1.08997535  0.91859664  0.8170393  1.92489254
                 GSM1094849 GSM1094850 GSM1094851  GSM1094852  GSM1094853
ENSG00000000003  9.29497760  7.5027098  6.9593119  8.33588532  8.16826110
ENSG00000000005 -0.09269777 -0.1712545  0.6359455 -0.04951916 -0.05576644
ENSG00000000419 11.16941260 14.4432389  9.5329440 11.50854450  9.46361675
ENSG00000000457  1.45557567  1.4528744  1.1872029  1.31744971  1.20863565
ENSG00000000460  0.43611482  0.3623008  0.5544340  0.41423708  0.33909186
ENSG00000000938  1.18025639  1.5680560  1.6256345  0.86704140  1.48853231
                GSM1094854 GSM1094855 GSM1094856 GSM1094857  GSM1094858
ENSG00000000003  9.8020077 7.92580451  8.5122426  8.5300217  6.45774124
ENSG00000000005 -0.1606687 0.02223393 -0.1340762  0.0143126 -0.02043163
ENSG00000000419 11.8011262 8.66606873  8.6484013  8.7775212  9.88540523
ENSG00000000457  1.4582023 1.40127133  1.2111514  1.3356778  1.78911533
ENSG00000000460  0.8086875 0.44922354  0.5992390  0.3759818  0.35878489
ENSG00000000938  2.2089847 0.75746472  0.7677913  0.8519271  1.05184938
                 GSM1094859  GSM1094860 GSM1094861 GSM1094862 GSM1094863
ENSG00000000003  8.06834906  6.29704946 9.59510150  8.2198571  6.0207988
ENSG00000000005 -0.08355142 -0.04656716 0.03031249 -0.1646146 -0.1405284
ENSG00000000419  8.67761484 14.59843763 8.03382884 11.0881792 10.1123251
ENSG00000000457  1.45785045  1.19855713 1.31224455  1.1589472  1.5021307
ENSG00000000460  0.62352156  0.64477354 0.30380735  0.3217531  0.3130842
ENSG00000000938  0.70757412  1.78181966 0.78496368  1.5556700  0.6950488
                 GSM1094864  GSM1094865  GSM1094866  GSM1094867  GSM1094868
ENSG00000000003 10.60783176 10.23536609 9.031964212  7.66629540  1.06494502
ENSG00000000005  0.02602656 -0.07968734 0.007326678 -0.09030723 -0.09543977
ENSG00000000419  9.68412621 12.79993557 9.794484639 10.23536609 12.97072224
ENSG00000000457  1.47876418  1.46084185 1.585389962  1.59111604  1.76232531
ENSG00000000460  0.57717175  0.64888234 0.777446911  0.64510940  0.11986491
ENSG00000000938  0.54561199  0.59862564 0.479774397  0.37790623  0.83880649
                GSM1094869 GSM1094870 GSM1094871
ENSG00000000003  1.0408332  1.7262079  1.0292255
ENSG00000000005 -0.1023734 -0.1537910 -0.1100603
ENSG00000000419  8.6196300 11.9588188 10.8900846
ENSG00000000457  1.5755541  1.7445362  1.6275308
ENSG00000000460  0.2372586  0.3456454  0.1885289
ENSG00000000938  0.7200364  0.9199667  0.7240255

Row names like this can be very convenient for keeping matrices organized, but row names (and column names) can be lost or misordered if you are not careful, especially during input and output, so treat them with care.

Although the apply functions may not be as easy to use as the tidyverse functions, for some applications, apply methods can be better suited. In this workshop, we will not delve too deeply into the various other apply functions (tapply(), lapply(), etc.) but you can read more information about them here.

The dplyr::join functions

Let’s say we have a scenario where we have two data frames that we would like to combine. Recall that stats_df and gene_df are data frames that contain information about some of the same genes. The dplyr::join family of functions are useful for various scenarios of combining data frames.

For now, we will focus on inner_join(), which will combine data frames by only keeping information about matching rows that are in both data frames. We need to use the by argument to designate what column(s) should be used as a key to match the data frames. In this case we want to match the gene information between the two, so we will specify that we want to compare values in the ensembl_id column from stats_df to the Gene column from gene_df.

stats_df %>% 
  inner_join(gene_df, by = c('ensembl_id' = 'Gene')) 

Save data to files

Save to TSV files

Let’s write some of the data frames we created to a file. To do this, we can use the readr library of _write() functions. The first argument of write_tsv() is the data we want to write, and the second argument is a character string that describes the path to the new file we would like to create. Remember that we created a results directory to put our output in, but if we want to save our data to a directory other than our working directory, we need to specify this. This is what we will use the file.path() function for. Let’s look in a bit more detail what file.path() does, by examining the results of the function in the examples below.

# Which of these file paths is what we want to use to save our data to the
# results directory we created at the beginning of this notebook?
file.path("docker-install", "stats_summary.tsv")
[1] "docker-install/stats_summary.tsv"
docker-install/stats_summary.tsv
file.path("results", "stats_summary.tsv")
[1] "results/stats_summary.tsv"
results/stats_summary.tsv
file.path("stats_summary.tsv", "results")
[1] "stats_summary.tsv/results"
stats_summary.tsv/results

Replace <NEW_FILE_PATH> below with the file.path() statement from above that will successfully save our file to the results folder

# Write our data frame to a TSV file
readr::write_tsv(stats_summary_df, <NEW_FILE_PATH>)

Check in your results directory to see if your new file has successfully saved.

Save to RDS files

For this example we have been working with data frames, which are conveniently represented as TSV or CSV tables. However, in other situations we may want to save more complicated or very large data structures, RDS (R Data Serialized/Single) files may be a better option for saving our data. RDS is R’s special file format for holding data exactly as you have it in your R environment. RDS files can also be compressed, meaning they will take up less space on your computer. Let’s save our data to an RDS file in our results folder. You will need to replace the .tsv with .RDS, but you can use what we determined as our file path for the last chunk as your template.

# Write your object to an RDS file
readr::write_rds(stats_summary, <PUT_CORRECT_FILE_PATH_HERE>)

Read an RDS file

Since now you have learned the readr functions: read_tsv(), write_tsv(), and now, write_rds(), what do you suppose the function you will need to read your RDS file is called? Use that function here to re-import your data in the chunk we set up for you below.

# Read in your RDS file
reimport_df <- <PUT_FUNCTION_NAME>(file.path("results", "stats_clean.RDS"))

As is good practice, we will end this session by printing out our session info.

Session Info

# Print out the versions and packages we are using in this session
sessionInfo()
R version 3.6.1 (2019-07-05)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Mojave 10.14.6

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] forcats_0.4.0   stringr_1.4.0   dplyr_0.8.3     purrr_0.3.4    
[5] tidyr_0.8.3     tibble_3.0.1    ggplot2_3.2.1   tidyverse_1.2.1
[9] readr_1.3.1    

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.4.6     cellranger_1.1.0 pillar_1.4.4     compiler_3.6.1  
 [5] base64enc_0.1-3  tools_3.6.1      digest_0.6.25    lubridate_1.7.4 
 [9] jsonlite_1.6     evaluate_0.14    lifecycle_0.2.0  nlme_3.1-141    
[13] gtable_0.3.0     lattice_0.20-38  pkgconfig_2.0.3  rlang_0.4.6     
[17] cli_2.0.2        rstudioapi_0.11  yaml_2.2.1       haven_2.1.1     
[21] xfun_0.14        withr_2.2.0      xml2_1.2.2       httr_1.4.1      
[25] knitr_1.28       generics_0.0.2   vctrs_0.3.1      hms_0.5.3       
[29] rprojroot_1.3-2  grid_3.6.1       tidyselect_0.2.5 glue_1.4.1      
[33] R6_2.4.1         fansi_0.4.1      readxl_1.3.1     rmarkdown_1.14  
[37] modelr_0.1.5     magrittr_1.5     scales_1.0.0     backports_1.1.7 
[41] ellipsis_0.3.1   htmltools_0.3.6  rvest_0.3.4      assertthat_0.2.1
[45] colorspace_1.4-1 labeling_0.3     stringi_1.4.6    lazyeval_0.2.2  
[49] munsell_0.5.0    broom_0.5.2      crayon_1.3.4    
---
title: "Introduction to tidyverse"
output:   
  html_notebook: 
    toc: true
    toc_float: true
editor_options: 
  chunk_output_type: inline
---

**CCDL 2020**

## Objective for this notebook analysis: 

We'll use the same gene expression dataset we used in the [previous notebook](./02-intro_to_ggplot2.Rmd).
It is a pre-processed [astrocytoma microarray dataset](https://www.refine.bio/experiments/GSE44971/gene-expression-data-from-pilocytic-astrocytoma-tumour-samples-and-normal-cerebellum-controls) 
that we performed a set of [differential expression analyses on](./scripts/00-setup-intro-to-R.R).  

**More tidyverse resources:**  
- [R for Data Science](https://r4ds.had.co.nz/)  
- [tidyverse documentation](https://dplyr.tidyverse.org/)  
- [Cheatsheet of tidyverse data transformation](https://github.com/rstudio/cheatsheets/raw/master/data-transformation.pdf)  
- [Online tidyverse book chapter](https://privefl.github.io/advr38book/tidyverse.html)  

## Set Up

The tidyverse is a collection of packages that are handy for general data 
wrangling, analysis, and visualization. 
Other packages that are specifically handy for different biological analyses are 
found on [Bioconductor](https://www.bioconductor.org/).
If we want to use a package's functions we first need to install them.

Our RStudio Server already has the `tidyverse` group of packages installed for you. 
But if you needed to install it or other packages available on CRAN, you 
do it using the `install.packages()` function like this: 
`install.packages("tidyverse")`.

```{r tidyverse}
library(tidyverse)
```

### Referencing a library's function with `::`

Note that if we had not imported the tidyverse set of packages using `library()` 
like above, and we wanted to use a tidyverse function like `read_tsv()`, we 
would need to tell R what package to find this function in.
To do this, we would use `::` to tell R to load in this function from the 
`readr` package by using `readr::read_tsv()`.
You will see this `::` method of referencing libraries within packages 
throughout the course. 
We like to use it in part to remove any ambiguity in which version of a 
function we are using; it is not too uncommon for different packages to use the 
same name for very different functions!

## Managing directories

Before we can import the data we need, we should double check where R is 
looking for files, aka the current **working directory**. 
We can do this by using the `getwd()` function, which will tell us what folder
we are in. 

```{r workingdir, live = TRUE}
# Let's check what directory we are in:
getwd()
```

For Rmd files, the working directory is wherever the file is located, but 
commands executed in the console may have a different working directory.

We will want to make a directory for our output and we will call this directory: 
`results`. 
But before we create the directory, we should check if it already exists. 
We will show two ways that we can do this. 

First, we can use the `dir()` function to have R list the files in our working 
directory. 

```{r}
# Let's check what files are here
dir()
```

This shows us there is no folder called "results" yet. 

If we want to more pointedly look for "results" in our working directory we can 
use the `dir.exists()` function.

```{r check-dir, live = TRUE}
# Check if the results directory exists
dir.exists("results")
```

If the above says `FALSE` that means we will need to create a `results` 
directory using the function `dir.create()`.

```{r create-dir, live = TRUE}
# Make a directory within the working directory called 'results'
dir.create("results")
```

After creating the results directory above, let's re-run `dir.exists()` to see 
if now it exists.

```{r check-dir-again, live = TRUE}
# Re-check if the results directory exists
dir.exists("results")
```

We can use the output of `dir.exists()` to automatically create or hold off on 
creating a directory by putting this together in an `if` statement like below. 
An `if` statement has two main parts:
First, the test, which is an expression that will result in either `TRUE` or `FALSE`.
This is put in parenthesis immediately after the `if`.
The next part is the body, which is the commands that will be executed *if* the 
test is `TRUE`.
These are placed within a set of braces `{ }`.
Note that we used an exclamation point in the test to signify that we want a 
directory to be created only *if* `dir.exists(results)` is NOT equal to `TRUE`.

```{r create-if}
# If 'results' directory doesn't exist...
if (!dir.exists("results")) {
  # ... create a 'results' directory
  dir.create("results")
}
```

The `dir.exists()` function will not work on files themselves.
In that case, there is an analogous function called `file.exists()`.

Try using the `file.exists()` function to see if the file 
`gene_results_GSE44971.tsv` exists in the current directory.
Use the code chunk we set up for you below. 
Note that in our notebooks (and sometimes elsewhere), wherever you see a 
`<FILL_IN_THE_BLANK>` like in the chunk below, that is meant for you to replace 
(including the angle brackets) with the correct phrase before you run the chunk 
(otherwise you will get an error).

```{r file-check, eval=FALSE}
# Replace the <PUT_FILE_NAME_HERE> with the name of the file you are looking for
# Remember to use quotes to make it a character string
file.exists(<PUT_FILE_NAME_HERE>)
```

Now that we've determined that `gene_results_GSE44971.tsv` exists, we are ready 
to read it into our R environment.

#### Read a TSV file

Declare the name of the directory where we will read in the data. 

```{r}
data_dir <- "data"
```

Although base R has functions to read in data files, the functions in the 
`readr` package (part of the tidyverse) are faster and more straightforward 
to use so we are going to use those here. 
Because the file we are reading in is a TSV (tab separated values) file we will 
be using the `read_tsv` function. 
There are analogous functions for CSV (comma separated values) files 
(`read_csv()`) and other files types.

## Read in the differential expression analysis results file

```{r read-results}
stats_df <- readr::read_tsv(
  file.path(data_dir,
            "gene_results_GSE44971.tsv")
  )
```

Following the template of the previous chunk, use this chunk to read in the file
`GSE44971.tsv` that is in the `data` folder and save it in the variable `gene_df`. 

```{r read-expr, live = TRUE}
# Use this chunk to read in data from the file `GSE44971.tsv`
gene_df <- readr::read_tsv(
  file.path(data_dir,
            "GSE44971.tsv")
  )
```

Use this chunk to explore what `gene_df` looks like. 

```{r explore}
# Explore `gene_df`

```

What information is contained in `gene_df`?

## dplyr pipes

One nifty feature of the tidyverse is pipes: `%>%`
These handy things allows you to funnel the result of one expression to the next,
making your code a little more streamlined.

For example, the output from this:  

```{r filter}
filter(stats_df, contrast == "male_female")
```  

...is the same as the output from this:  

```{r filter-pipe}
stats_df %>% filter(contrast == "male_female")
```  
  
This can make your code cleaner and easier to follow a series of related 
commands. 
Let's look at an example with our stats of of how the same 
functions look with or without pipes:

*Example 1:* without pipes: 

```{r steps-nopipe}
stats_arranged <- arrange(stats_df, t_statistic)
stats_filtered <- filter(stats_arranged, avg_expression > 50)
stats_nopipe <- select(stats_filtered, contrast, log_fold_change, p_value)
```
  
UGH, we have to keep track of all of those different intermediate data frames 
and type their names so many times here! 
We could maybe streamline things by using the same variable name at each stage, 
but even then there is a lot of extra typing, and it is easy to get confused 
about what has been done where.
It's annoying and makes it harder for people to read. 
  
*Example 2:* Same result as 1 but with pipes!

```{r steps-pipe, live = TRUE}
# Example of the same modifications as above but with pipes!
stats_pipe  <- stats_df %>%
               arrange(t_statistic) %>%
               filter(avg_expression > 50) %>%
               select(contrast, log_fold_change, p_value)
```

What the `%>%` (pipe) is doing here is feeding the result of the expression on 
its left into the first argument of the next function (to its right, or on the 
next line here). 
We can then skip that first argument (the data in these cases), and move right 
on to the part we care about at that step: what we are arranging, filtering, or 
selecting  in this case.

Let's double check that these are the same by using the function, `all.equal()`. 

```{r check-pipe}
all.equal(stats_nopipe, stats_pipe)
```

`all.equal()` is letting us know that these two objects are the same. 

Now that hopefully you are convinced that the tidyverse can help you make your 
code neater and easier to use and read, let's go through some of the popular 
tidyverse functions and so we can create pipelines like this. 

## Common tidyverse functions

Let's say we wanted to filter this gene expression dataset to particular sample
groups.
In order to do this, we would use the function `filter()` as well as a logic 
statement (usually one that refers to a column or columns in the data frame).

```{r filter-gene}
# Here let's filter stats_df to the gene_symbol "SNCA"
stats_df %>% 
  filter(gene_symbol == "SNCA")
```

We can use `filter()` similarly for numeric statements.  

```{r filter-numeric, live = TRUE}
# Here let's filter the data to rows with average expression values above 50
stats_df %>%
  filter(avg_expression > 50)
```

We can apply multiple filters at once, which will require all of them to be 
satisfied for every row in the results:

```{r filter-2, live = TRUE}
# filter to highly expressed genes with contrast "male_female"
stats_df %>%
  filter(contrast == "male_female", 
         avg_expression > 50)
```

When we are filtering, the `%in%` operator can come in handy if we have multiple
items we would like to match.
Let's take a look at what using `%in%` does.

```{r in-example, eval = FALSE}
genes_of_interest <- c("SNCA", "CDKN1A")
stats_df$gene_symbol %in% genes_of_interest
```

`%in%` returns a logical vector that now we can use in `dplyr::filter`.

```{r filter-in, live = TRUE}
# filter to genes of interest
stats_df %>% 
  filter(gene_symbol %in% c("SNCA", "CDKN1A"))
```

Let's return to our first `filter()` and build on to it. 
This time, let's keep only some of the columns from the data frame using the 
`select()` function. 
Let's also save this as a new data frame called `stats_filtered_df`.

```{r filter-select, live = TRUE}
# filter to highly expressed "male_female"
# and select gene_symbol, log_fold_change and t_statistic
stats_filtered_df <- stats_df %>%
  filter(contrast == "male_female", 
         avg_expression > 50) %>%
  select(log_fold_change, t_statistic)
```

Let's say we wanted to arrange this dataset so that the genes are arranged by 
the smallest p values to the largest.
In order to do this, we would use the function `arrange()` as well as the column
we would like to sort by (in this case `p_value`).

```{r arrange}
stats_df %>% 
  arrange(p_value) 
```

What if we want to sort from largest to smallest? 
Like if we want to see the genes with the highest average expression?
We can use the same function, but instead use the `desc()` function and now we 
are using `avg_expression` column. 

```{r arrange-desc}
# arrange descending by avg_expression
stats_df %>%
  arrange(desc(avg_expression))
``` 

What if we would like to create a new column of values?
For that we use `mutate()` function.

```{r mutate}
stats_df %>% 
  mutate(log10_p_value = -log10(p_value))
```

What if we want to obtain summary statistics for a column or columns?
The `summarize` function allows us to calculate summary statistics for a column. 
Here we will use summarize to obtain an mean log folder change over all the
genes, and its standard deviation.

```{r summarize}
stats_df %>% 
  summarize(mean(log_fold_change),
            sd(log_fold_change))
```

What if we'd like to obtain a summary statistics but have them for various 
groups?
Conveniently named, there's a function called `group_by()` that seamlessly 
allows us to do this. 
Also note that `group_by()` allows us to group by multiple variables at a time 
if you want to.

```{r summarize-groups, live = TRUE}
stats_summary_df <- stats_df %>%
      group_by(contrast) %>% 
      summarize(mean(log_fold_change),
                sd(log_fold_change))
```

Let's look at a preview of what we made:

```{r}
stats_summary_df
```

Here we have the mean log fold change expression per each contrast we made. 

## A brief intro to the `apply` family of functions

In base R, the `apply` family of functions can be an alternative methods for 
performing transformations across a data frame, matrix or other object structures. 

One of this family is (shockingly) the function `apply()`, which operates on 
matrices.

A matrix is similar to a data frame in that it is a rectangular table of data,
but it has an additional constraint: 
rather than each column having a type, ALL data in a matrix has the same type.

The first argument to `apply()` is the data object we want to work on.
The third argument is the function we will apply to each row or column of the 
data object.
The second argument in specifies whether we are applying the function 
across rows or across columns (1 for rows, 2 for columns).

Remember that `gene_df` is a gene x sample gene expression data frame that has
columns of two different types, character and numeric, but converting it to a
matrix will require us to make them all the same type.
We can still coerce it into a matrix using `as.matrix()`, in which case R will
pick a type that it can convert everything to.
What does it choose?

```{r matrix}
# Coerce `gene_df` into a matrix
gene_matrix <- as.matrix(gene_df)
```

```{r matrix-type, live = TRUE}
# Explore the structure of the `gene_matrix` object
str(gene_matrix)
```

While that worked, it is rare that we want numbers converted to text, so we are
going to select only the columns with numeric values.
We can do this most easily by removing the first column, which contains character
values.

```{r matrix-numeric, live = TRUE}
# Let's save a new matrix object names `gene_num_matrix` containing only
# the numeric values
gene_num_matrix <- as.matrix(gene_df[, -1])

# Explore the structure of the `gene_num_matrix` object
str(gene_num_matrix)
```

Why do we have a `[, -1]` after `gene_df` in the above chunk?

Now that the matrix is all numbers, we can do things like calculate the column
or row statistics using `apply()`.

```{r rowmeans}
# Calculate row means
gene_means <- apply(gene_num_matrix, 1, mean) # Notice we are using 1 here

# How long will `gene_means` be? 
length(gene_means)
```

Now let's investigate the same set up, but use 2 to `apply` over the columns of
our matrix.

```{r colmeans}
# Calculate sample means
sample_means <- apply(gene_num_matrix, 2, mean) # Notice we use 2 here

# How long will `sample_means` be? 
length(sample_means)
```

We can put the gene names back into the numeric matrix object by
assigning them as rownames.

```{r matrix-rownames, live = TRUE}
# Assign the gene names from gene_df$Gene to the `gene_num_matrix` object using
# the `rownames()` function
rownames(gene_num_matrix) <- gene_df$Gene

# Explore the `gene_num_matrix` object
head(gene_num_matrix)
```

Row names like this can be very convenient for keeping matrices organized, but
row names (and column names) can be lost or misordered if you are not careful,
especially during input and output, so treat them with care.

Although the `apply` functions may not be as easy to use as the tidyverse 
functions, for some applications, `apply` methods can be better suited.
In this workshop, we will not delve too deeply into the various other apply 
functions (`tapply()`, `lapply()`, etc.) but you can read more information about 
them [here](https://www.guru99.com/r-apply-sapply-tapply.html).

## The dplyr::join functions

Let's say we have a scenario where we have two data frames that we would like to 
combine. 
Recall that `stats_df` and `gene_df` are data frames that contain information 
about some of the same genes.
The [`dplyr::join` family of functions](https://dplyr.tidyverse.org/reference/join.html) 
are useful for various scenarios of combining data frames. 

For now, we will focus on `inner_join()`, which will combine data frames by only
keeping information about matching rows that are in both data frames.
We need to use the `by` argument to designate what column(s) 
should be used as a key to match the data frames.
In this case we want to match the gene information between the two, so we will 
specify that we want to compare values in the `ensembl_id` column from 
`stats_df` to the `Gene` column from `gene_df`.

```{r inner-join}
stats_df %>% 
  inner_join(gene_df, by = c('ensembl_id' = 'Gene')) 
```

## Save data to files

#### Save to TSV files

Let's write some of the data frames we created to a file.
To do this, we can use the `readr` library of `_write()` functions. 
The first argument of `write_tsv()` is the data we want to write, and the second 
argument is a character string that describes the path to the new file we would 
like to create.
Remember that we created a `results` directory to put our output in, 
but if we want to save our data to a directory other than our working directory, 
we need to specify this. 
This is what we will use the `file.path()` function for. 
Let's look in a bit more detail what `file.path()` does, by examining the 
results of the function in the examples below.

```{r file-path-quiz}
# Which of these file paths is what we want to use to save our data to the
# results directory we created at the beginning of this notebook?
file.path("docker-install", "stats_summary.tsv")
file.path("results", "stats_summary.tsv")
file.path("stats_summary.tsv", "results")
```

Replace `<NEW_FILE_PATH>` below with the `file.path()` statement from above that 
will successfully save our file to the `results` folder 

```{r eval=FALSE}
# Write our data frame to a TSV file
readr::write_tsv(stats_summary_df, <NEW_FILE_PATH>)
```

Check in your `results` directory to see if your new file has successfully saved.

#### Save to RDS files

For this example we have been working with data frames, which are conveniently 
represented as TSV or CSV tables. 
However, in other situations we may want to save more complicated or very large 
data structures, RDS (R Data Serialized/Single) files may be a better option for
saving our data.
RDS is R's special file format for holding data exactly as you have it in your 
R environment. 
RDS files can also be compressed, meaning they will take up less space on your 
computer. 
Let's save our data to an RDS file in our `results` folder.
You will need to replace the `.tsv` with `.RDS`, but you can use what we 
determined as our file path for the last chunk as your template. 

```{r eval=FALSE}
# Write your object to an RDS file
readr::write_rds(stats_summary, <PUT_CORRECT_FILE_PATH_HERE>)
```

#### Read an RDS file

Since now you have learned the `readr` functions: `read_tsv()`, `write_tsv()`, 
and now, `write_rds()`, what do you suppose the function you will need to read 
your RDS file is called? 
Use that function here to re-import your data in the chunk we set up for you
below.

```{r eval=FALSE}
# Read in your RDS file
reimport_df <- <PUT_FUNCTION_NAME>(file.path("results", "stats_clean.RDS"))
```

As is good practice, we will end this session by printing out our session info. 

### Session Info

```{r}
# Print out the versions and packages we are using in this session
sessionInfo()
```
